Abstract
Background Recent advances in Artificial Intelligence (AI), particularly in Large Language Models (LLMs) like GPT-4o (ChatGPT) and others, have shown impressive performance in medical domains, including passing licensing exams and, in some cases, surpassing physicians in general diagnostic and reasoning tasks. However, their reliability and clinical utility in highly specialized, real-world medical settings—such as in hematology diagnostics and therapy—have not been rigorously evaluated. Malignant hematology poses unique challenges due to its complex pathophysiology, layered diagnostic frameworks, and the need for nuanced, high-stakes clinical decision-making mainly derived from highly specialized physicians, making it an ideal testbed for assessing the true capabilities and limitations of these models.
Objectives To evaluate how well state-of-the-art LLMs handle real-world hematology cases, focusing on their ability to make accurate diagnoses, predict outcomes, follow treatment guidelines, and suggest relevant clinical trials.
Method We developed a test set of 30 complex, real-world clinical cases of myelodysplastic syndromes (MDS). We chose MDS as a representative hematologic malignancy due to its diagnostic complexity and need for expert subspecialty care. Each case required integration of clinical, morphological, cytogenetic, and molecular data—mirroring real-life decision-making in hematology. A standardized prompt was used to query multiple LLMs—ChatGPT (GPT-4o and GPTo3), Claude, and DeepSeek. Models were tasked with providing a diagnosis per WHO 2022/ICC criteria, calculating IPSS-R/IPSS-M risk scores, and recommending appropriate treatment and clinical trials. Responses were independently reviewed by a blinded panel of eleven international MDS experts, who scored them on diagnostic accuracy, prognostic assessment, and treatment relevance on a Likert scale 1-5, with score ≥ 4 considered correct per expert opinion. Factual errors were also categorized as none, minor, or major. To evaluate the consistency of expert ratings, we used the intraclass correlation coefficient (ICC) to measure how well experts agreed on numerical scores and Cohen’s κ (kappa) to assess their agreement when identifying errors.
Results The highest-performing model was GPT-o3, achieving 58% agreement with expert clinical assessments. This was followed by GPT-4o (42%), DeepSeek (31%), and Claude (26%). On a 1–5 scale, the average expert-assigned scores across domains were as follows: GPT-o3 (overall 3.48; Diagnosis 3.68, Prognosis 3.58, Treatment 3.56, Clinical Trials 3.09), GPT-4o (3.22; 3.14 / 3.20 / 3.39 / 3.16), DeepSeek (2.98; 2.92 / 3.01 / 3.15 / 2.83), and Claude (2.86; 2.72 / 2.90 / 3.09 / 2.73), respectively.
Major factual errors (hallucinations) were frequent across all models, each exceeding a 25% rate: GPT-o3 and GPT-4o (both 26%), DeepSeek (33%), and Claude (36%). Minor factual error rates were similarly high: GPT-o3 and Claude (47%), DeepSeek (49%), and GPT-4o (52%).
Experts showed strong agreement in their evaluations, with high consistency in scoring (ICC = 0.81) and in identifying AI errors or hallucinations (κ = 0.76), confirming the reliability of the review process.
Conclusion Despite recent reports suggesting that models like ChatGPT have outperformed physicians in diagnostic accuracy and clinical decision-making, current state-of-the-art LLMs underperform in highly specialized and complex clinical scenarios in hematologic malignancies. Even advanced reasoning models such as GPT-o3 have fallen short of expert expectations. These findings underscore that general-purpose LLMs are not yet suitable for autonomous clinical use in hematology. Their deployment should be approached with caution, and further research is essential to rigorously evaluate their performance across all subdomains of hematology.